Goto

Collaborating Authors

 non-robust feature


AdversarialExamplesarenotBugs,theyareFeatures

Neural Information Processing Systems

Wedemonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features (derived from patterns in the data distribution) that are highly predictive, yet brittle and (thus) incomprehensible to humans. After capturing these features within a theoretical framework, we establish their widespread existence in standard datasets.








8e5e15c4e6d09c8333a17843461041a9-Supplemental.pdf

Neural Information Processing Systems

Tiny-ImageNet isasmall subset of ImageNet dataset, containing 100,000 training images, 10,000 validation images, and 10,000 testing images separated in 200 different classes, dimensionsofwhichare64 64pixels. Here,anapproximate featureprobability q(Z) is introduced to approximate the true feature probabilityp(Z). The additional results are illustrated in Figure 1. We provide additional feature visualization under various adversarial attack methods including NRF in Figure 1-5 (CIFAR-10, SVHN, and Tiny-ImageNet are utilized). Moreover,thedistilled features still include therobustand brittle information eveninthefailed attack examples.


max

Neural Information Processing Systems

Toclarifywhere the adversarial brittleness truly comes from, we need to figure out how the robust and non-robust features in data manifold subtly manipulate feature representation and fool model prediction, by directly handling them in the feature space. To address it, we propose a way to precisely distill intermediate features into robust and non-robust features by employing Information Bottleneck (IB) [17, 18, 19].